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CD I A number of recent work studied the effectiveness of feature selection using Lasso. It is 

ly-j ■ known that under the restricted isometry properties (RIP), Lasso does not generally lead to the 

exact recovery of the set of nonzero coefficients, due to the looseness of convex relaxation. This 
paper considers the feature selection property of nonconvex regularization, where the solution is 
given by a multi-stage convex relaxation scheme. Under appropriate conditions, we show that 
the local solution obtained by this procedure recovers the set of nonzero coefficients without 
suffering from the bias of Lasso relaxation, which complements parameter estimation results of 
this procedure in [IB] . 

^ ■ 1 Introduction 

>' ... 

We consider the linear regression problem, where we observe a set of input vectors xi, . . . , x n € R p , 
with corresponding desired output variables yx,...,y n . In a statistical linear model, it is common 
to assume that there exists a target coefficient vector w £ R p such that 

i/i = w Xj + q (i = l,...,n), (1) 

where are zero- mean independent random noises (but not necessarily identically distributed). 
Moreover, we assume that the target vector w is sparse. That is, k = ||w||o is small. Here we use 
the standard notation 

supp(w) = {j : Wj ^ 0} || w|| = |supp(w)| 

for any vector w G R p . 

This paper focuses on the feature selection problem, where we are interested in estimating the 
set of nonzero coefficients supp(w) (also called support set). Let y denote the vector of [yj and X 
be the n x d matrix with each row a vector Xj. The standard statistical method is subset selection 
(Lq regularization), which computes the following estimator 

wl = arg min ||Xw — y||| subject to ||w||o < k, (2) 

where k is a tuning parameter. This method is arguably a natural method for feature selection 
because if noise ei are iid Gaussian random variables, then ([2]) can be regarded as a Bayes procedure 
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with an appropriately denned sparse prior over w. However, because the optimization problem in 
([2]) is nonconvex, the global solution of this problem cannot be efficiently computed. In practice, one 
can only find an approximate solution of ([2]) . The most popular approximation to Lq regularization 
is the L\ regularization method which is often referred to as Lasso [9j: 



where A > is an appropriately chosen regularization parameter. 

The global optimum of ([3]) can be easily computed using standard convex programming tech- 
niques. It is known that in practice, L\ regularization often leads to sparse solutions (although 
often suboptimal). Moreover, its performance has been theoretically analyzed recently. For ex- 
ample, it is known from the compressed sensing literature (e.g., [3]) that under certain conditions 
referred to as restricted isometry property (RIP), the solution of L\ relaxation ([3]) approximates 
the solution of the Lq regularization problem ([2]). The prediction and parameter performance of 
this method has been considered in [21 [H El [HJ El [TO] • Exact support recovery was considered by 
various authors such as [H [T8l [TT] . It is known that under some more restrictive conditions referred 
to as irrepresentable conditions, L\ regularization can achieve exact recovery of the support set. 
However, the L\ regularization method (J3]) does not achieve exact recovery of the support set under 
the RIP type of conditions, which we are interested in here. 

Although it is possible to achieve exact recovery using post-processing by thresholding the 
small coefficients of Lasso solution, this method is suboptimal under RIP in comparison to the 
Lq regularization method ([2]) because it requires the smallest nonzero coefficients to be Vk times 
larger than the noise level instead of only requiring the nonzero coefficients to be larger than the 
noise level with Lq regularization in (|2]). This issue, referred to as the bias of Lasso for feature 
selection, was extensively discussed in [13]. Detailed discussion can be found after Theorem [TJ It 
is worth mentioning that under a stronger mutual coherence condition (similar to irrepresentable 
condition), this post-processing step does not give this bias factor Vk as shown in [7] (also see 
[15]). Therefore the advantage of bias removal for the multi-stage procedure discussed here is 
only applicable when RIP holds but when the irrepresentable condition and mutual incoherence 
conditions fail. A thorough discussion of various conditions is beyond the scope of the current paper, 
and we would like to refer the readers to [TO]. Nevertheless, it is worth pointing out that even in the 
classical d < n setting with the design matrix X being rank d, the irrepresentable condition or the 
mutual incoherence condition can still be violated while the RIP type sparse-eigenvalue condition 
used in this paper holds trivially. In fact, this was pointed out in [TO] as the main motivation of 
adaptive Lasso. Adaptive Lasso behaves similarly to the above mentioned post-processing, and 
thus suffers from the same bias problem. 

The bias of Lasso is due to the looseness of convex relaxation for Lq regularization. Therefore 
the remedy is to use a non-convex regularizer that is close to Lq regularization. One drawback 
of using nonconvex optimization formulation is that we can only find a local optimal solution and 
different computational procedure may lead to a different local solution. Therefore the theoretical 
analysis has to be integrated with specific computational procedure to show that the local minimum 
obtained by the procedure has desirable properties (e.g., exact support recovery). Several nonconvex 
computational procedures have been analyzed in the literature, including an adaptive forward 
backward greedy procedure (referred to as FoBa) to approximately solve the regularization method 
(J2J considered in [17j . and the MC+ method in |13j to solve a non-convex regularized problem 
using a path-following procedure. Both methods can achieve unbiased feature selection. 
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Related to the above mentioned work, a different procedure, referred to as multi-stage convex 
relaxation, was analyzed in [16]. This procedure solves a nonconvex problem using multiple stages 
of Lasso relaxations, where convex formulations are iteratively refined based on solutions obtained 
from the previous stages. However, only parameter estimation performance was analyzed in |16j . 
Unfortunately, the result in [16] does not directly imply that multi-stage convex relaxation achieves 
unbiased recovery of the support set. The purpose of this paper is to prove such a support recov- 
ery result analogous to related result in [13] (which is for a different procedure), and this result 
complements the parameter estimation result of |16| . 

2 Multi-Stage Convex Relaxation with Capped-Li Regular izat ion 

We are interested in recovering w from noisy observations y using the following nonconvex regu- 
larization formulation: 



where gdw^l) is a regularization function. For simplicity, this paper only considers the specific 
regularizer 



which is referred to as capped-Li regularization in [16j . The parameter 6 is a thresholding parameter 
which says that we use L\ penalization when a coefficient is sufficiently small, but the penalty does 
not increase when the coefficient is larger than a threshold 9. Detailed discussions can be found in 
[16j . Similar to [16] . one can analyze general regularization function g{u). However, some of such 
functions (such as adaptive Lasso) do not completely remove the bias. Therefore we only analyze 
the simple function ([5]) in this paper for clarity. While a theoretical justification has been given in 
|16j for multi-stage convex relaxation, similar procedure has been shown to work well empirically 
without theoretical justification [H[T2]- Moreover, a two-stage version was proposed in [20j . which 
does not remove the bias issue discussed in this paper. 

Since the regularizer ([5j) is nonconvex, the resulting optimization problem ([4]) is a non-convex 
regularization problem. However the regularizer in ([5]) is continuous and piecewise differentiable, 
and thus its solution is easier to compute than the Lq regularization method in ([2]). For example, 
standard numerical techniques such as sub-gradient descent lead to local minimum solutions. Un- 
fortunately, it is difficult to find the global optimum, and it is also difficult to analyze the quality 
of the local minimum obtained from the gradient descent method. As a matter of fact, results 
with non-convex regularization are difficult to reproduce because different numerical optimization 
procedures can lead to different local minima. Therefore the quality of the solution heavily depend 
on the numerical procedure used. 

In the following, we consider a specific numerical procedure referred to as multi-stage convex 
relaxation in [16] . The algorithm is given in Figure [TJ The procedure converges to a local optimal 
solution of (|4]) due to a simple concave duality argument, where (j4]) is rewritten as 
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w = argmm mm 
w {Aj>0} 
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with g*(Xj) = max((A — A,)0, 0). The procedure of Figure [T] can be regarded as an alternating 
optimization method to solve this joint optimization problem of w and {Xj}, where the first step 
solves for w with {Aj} fixed, and the second step is the closed form solution of {A^} with w fixed. A 
more detailed discussion can be found in [16J. Our goal is to show that this procedure can achieve 
unbiased feature selection as described in [15] . 



Initialize Xj 


= A for j = 


l,...,d 






For £= 1,2,. 










• Let 














- (£) 

w v ' = arg mm 

wG-RP 




(6) 


. Let xf 


= A/(|wf| 


<0) (j = l,...,d) 





Figure 1: Multi-stage Convex Relaxation for Sparse Regularization 



3 Theoretical Analysis 

We require some technical conditions for our analysis. First we assume sub-Gaussian noise as 
follows. 

Assumption 1 Assume that {ej}j = i ... n in fi]J are independent (but not necessarily identically 
distributed) sub-Gaussians: there exists o > such that Mi and Vt G R, 

B ei e tei < e a2t2 ' 2 . 

Both Gaussian and bounded random variables are sub-Gaussian using the above definition. For 
example, if a random variable £ G [a, b], then E^e*^ - *^ < e^ _a ^ * / 8 . If a random variable is 
Gaussian: £ ~ N(0,a 2 ), then E f e* f < e' 7 '* 2 / 2 . 

We also introduce the concept of sparse eigenvalue, which is standard in the analysis of L\ 
regularization. 

Definition 1 Given k, define 

p+(k) =sup < — 1 1 -X" w 1 1 q / 1 1 w 1 1 2 : ll w llo — k 
I n 

P-(k) = inf <^ — 1| w 1 1 1 / 1 1 w 1 1 1 : ll w llo < k\ . 

The following result for parameter estimation was obtained in [TB] , under the Assumption [TJ If 
we assume that the target w is sparse, with Ey, = w T Xj, and k = ||w||o, and we choose 6 and A 
such that 

A > 2<0ay/2p+(\)\n.{2p/rj)/n 

and 

9 > 9X/p_(2k + s). 
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Assume that p+(s) / p-(2k + 2s) < 1 + 0.5s /k for some s > 2k, then with probability larger than 
1 — rj: 



i-M -II ^ 17 

W l ; — W o < 



p-(2k + s) 



0.7 l ■ VkX 

+ (0 t^ y (7) 
p-(2k + s) 



where is the solution of ©, and = \{j E F : \wj\ < 20} I. 

The condition p + (s)/p_(2fc + 2s) < l + 0.5s/fc requires the eigenvalue ratio p+(s) / 'p-(s) to grow 
sub-linearly in s. Such a condition, referred to as sparse eigenvalue condition, is also needed in 
the standard analysis of L% regularization |14^ 115] . It is related but slightly weaker than the RIP 
condition in compressive sensing [3], which requires the condition 

1-S S > < p-(s') < p + (s') <1 + 8 S ,, 

for some 6 s i £ (0, 1) and s' > k. For example, with s' = 6k, and the restricted isometry constant 
8 S ' — 1/3, then the sparse eigenvalue condition above holds with s = 2k. For simplicity, in this 
paper we do not make distinctions between RIP and sparse eigenvalue condition. Note that in 
the traditional low-dimensional statistical analysis, one assumes that p + {s) / p-{2k + 2s) < oo as 
s — > oo, which is significantly stronger than the condition we use here. Although in practice it is 
often difficult to verify the sparse eigenvalue condition for real problems, the parameter estimation 
result in (|7|) nevertheless provides important theoretical insights for multi-stage convex relaxation. 
For standard Lasso, we have the following bound 



\w Ll - w| 



0(VkX), 



where is the solution of the standard L\ regularization. This bound is tight for Lasso, in 
the sense that the right hand side cannot be improved except for the constant — this can be easily 
verified with an orthogonal design matrix. It is known that in order for Lasso to be effective, one 
has to pick A no smaller than the order aydnp/n. Therefore, the parameter estimation error of 
the standard Lasso is of the order o\Jk Inp/n, which cannot be improved. 

In comparison, if we consider the capped-Li regularization with <7(|wj|) defined in ([5]), the 
bound in ([7]) can be significantly better when most non-zero coefficients of w are relatively large in 
magnitude. In the extreme case where kg = \{j : \wj\ £ (0, 29]}\ = 0, which can be achieved when 
all nonzero components of w are larger than the order aydnp/n, we obtain the following better 
bound 

||w W - w|| 2 = 0{^jl/^i+ ^J\n(l/r])/n) 

for the multi-stage procedure for a sufficiently large t at the order of In k + In In p. This bound is 
superior to the standard one-stage L\ regularization bound ||w/, 1 — w||2 = 0{yk \m(p / rj) / n) . 

In the literature, one is often interested in two types of results, one is parameter estimation 
bound as in (|7|), and the other is feature selection consistency: that is, to identify the set of nonzero 
coefficients of the truth. Although the parameter estimation bound in ([7]) is superior to Lasso, the 
result does not imply that one can correctly select all variables under this condition. Moreover, the 
specific proof presented in [TB] does not directly imply such a result. Therefore it is important to 
know whether the multi-stage convex relaxation can achieve unbiased feature selection as studied 
in |13| . In the following, we present such a result which supplements the parameter estimation 
bound of ([7]). While the main high-level argument follows that of [16], there are many differences 
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in the details, and hence a full proof (which is included in Section [5]) is still needed. This theorem 
is the main result of the paper. It is worth mentioning that although we only consider the simple 
capped-Li regularizer, similar results can be obtained for other regularizers (with virtually the 
same proof) such that g'(u) £ [0, oo), g'(u) > when u belongs to a neighbor of 0, and g'(u) = 
when u > 9, with a threshold 9 > appropriately chosen at the order of the noise level — the 
condition of g'{u) = when u > 9 ensures the removal of feature selection "bias" of Lasso which we 
discussed above. As an example, very similar result can be obtained for the MC+ penalty of [13] 
or SCAD penalty of [5] using the multi-stage convex relaxation procedure here. In fact, in practice 
there may be additional advantages of using a smooth nonconvex penalty such as MC+ due to the 
extra smoothness, although such advantage is not revealed in our theoretical analysis. 

Theorem 1 Let Assumption^ hold. Assume also that the target w is sparse, with Eyj = w T Xj, 
and k = ||w||o. Let F = supp(w). Choose 9 and A such that 

\>7a^2p + (l)ln(2p/ri)/n 

and 

9 > 9A/p_(1.5jfe + «). 

Assume that 

min |wj| > 29 

and p+(s)/ p~(1.5k + 2s) < 1 + 2s /(3k) for some s > 1.5k, then with probability larger than 1 — r/: 

supp(w^) = supp(w) 
when £ > L, where is the solution of (EJ) and 

L _ 0.51nfc i 

Theorem [1] is the main result of this paper. If 

min |wj| > ca^Jlnp/n 
Wj eF 

for a sufficiently large constant c that is independent of k (but could depend on the RIP condition), 
then we can pick both parameters A = 0(ay / lnp/n) and 9 = 0(o~yJ\np/ri) at the noise level, 
so that Theorem Q] can be applied. In this case, Theorem [1] implies that multi-stage capped-Li 
regularization achieves exact recovery of the support set supp(w). In comparison, Lasso does not 
achieve exact sparse recovery under RIP conditions. While running Lasso followed by thresholding 
small coefficients to zero (or using adaptive Lasso of [19J or the two-stage procedure of |20| ) may 
achieve exact recovery, such a procedure requires the condition that 



min |wj| > c'ay klnp/n (8) 

for some constant c' (also depends on the RIP condition). This extra Vk factor is referred to as 
the bias of the Lasso procedure in [13]. Moreover, it is known that for exact recovery to hold, 
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the requirement of min Wjg ^ |wj| > ca^/lnp/n (up to a constant) is necessary for all statistical 

procedures, in the sense that if min Wj€ ^ |wj| < day/lnp/n for a sufficiently small constant c' 
(under appropriate RIP conditions), then no statistical procedure can achieve exact recovery with 
large probability. Therefore statistical procedures that can achieve exact support recovery under 
([8]) are referred to as (nearly) unbiased feature selection methods in [13J. Theorem Q] shows that 
multi-stage convex relaxation with capped-Lx regularization achieves unbiased feature selection. 

Results most comparable to what we have obtained here are that of the FoBa procedure in 
|17j and that of the MC+ procedure in [13]. Both can be regarded as (approximate) optimization 
methods for nonconvex formulations. The former is a forward backward greedy algorithm, which 
does not optimize (j3|) , while the latter is a path- following algorithm for solving formulations similar 
to Although results in [13] are comparable to ours, we should note that unlike our procedure, 
which is efficient due to the finite number of convex optimization, there is no proof showing that 
the path-following strategy in [13] is always efficient (in the sense that there may be exponentially 
many switching points). 

4 Simulation Study 

Numerical examples can be found in [TB] that demonstrate the advantage of multi-stage convex 
relaxation over Lasso. Therefore we shall not repeat a comprehensive study. Nevertheless, this 
section presents a simple simulation study to illustrate the theoretical results. The n x p design 
matrix X is generated with iid random Gaussian entries and each column is normalized with 2-norm 
yjn. Here n = 100 and p = 250. We then generate a vector w with k = 30 nonzero coefficients, 
and each nonzero coefficient is uniformly generated from the interval (1,10). The observation is 
y = Iw + e, where e is zero-mean iid Gaussian noise with standard deviation a = 1. We study the 
feature selection performance of Multi-stage convex relaxation method in Figure Q] using various 
configurations of A = TG\J\n.{jp)/n (with r = 1,2,4,8,16,32), and 9 = fi\ for various constants 
H = 0.5,1,2,4. 

The experiments are repeated for 100 times, and Tabled] reports the probability (percentage in 
the 100 runs) of exact support recovery for each configuration at various stages £. Note that £ = 1 
corresponds to Lasso and £ = 2 is an adaptive Lasso like two stage method [19] [20]. The main 
purpose of this study is to illustrate that it is beneficial to use more than two stages, as predicted 
by our theory. However, since only 0(ln(k)) is sufficient, optimal results can be achieved with 
relatively small number of stages. These conclusions can be clearly seen from Table [TJ Specifically 
the results for 1 = 2 are better than those of £ = 1 (standard Lasso), while results of £ = 4 are 
better than those of £ = 2. Although the performance of £ = 8 is even better, the improve over 
£ = 4 is small at the optimal configuration of A and 9. This is consistent with our theory, which 
implies that a relatively small number of stages is needed to achieve good performance. 

5 Proof of Theorem [I] 

The analysis is an adaptation of [16]. While the main proof structure is similar, there are neverthe- 
less subtle and important differences in the details, and hence a complete proof is still necessary. 
The main technical differences are as follows. The proof of [16] tracks the progress from one stage 
£ — 1 to the next stage £ using a bound on 2-norm parameter estimate, while in the current proof 
we track the progress using the set of variables that differ significantly from the true variables. 
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Table 1: Probability of Exact Support Recovery for Multi-stage Convex Relaxation 
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Moreover, in |16], we compare the current estimated parameter to the true parameter w, which is 
sufficient for parameter estimation. However, in order to establish feature selection result of this 
paper, it is necessary to compare the current estimated parameter to the least squares solution w 
within the true feature set F as defined below in Q. These subtle technical differences mean that 
many details in the proofs presented below differ from that of |16j . 

5.1 Auxiliary lemmas 

We first introduce some definitions. Consider the positive semi-definite matrix A = n~ 1 X T X £ 
W lxd . Given s, k > 1 such that s + k < d. Let /, J be disjoint subsets of {1, ... , d} with k and 
s elements respectively. Let Ajj £ R kxk kg the restriction of A to indices /, Aj^j £ R kxs be the 
restriction of A to indices / on the left and J on the right. Similarly we define restriction w/ of a 
vector w £ BP on I; and for convenience, we allow either w; £ R k or w/ £ RP (where components 
not in / are zeros) depending on the context. 

We also need the following quantity in our analysis: 

/, \ v T A/ju||v|| 2 
ir{k,s)= sup T |i — |i — . 

The following two lemmas are taken from |15j . We skip the proof. 
Lemma 1 The following inequality holds: 

s l/2 

n(k,s) < —y/ p+ ( s )/p_(k + s)-l, 

Lemma 2 Consider k, s > and G C {1, . . . , d} such that \G C \ = k. Given any w £ R p . Let J be 
the indices of the s largest components of wg (in absolute values), and I = G c U J. Then 

max(0, wjAw) > p^(k + s)(||w/|| 2 - ir(k + s, s) || w G || i/s) || w/ 1| 2 - 

Our analysis requires us to keep track of progress with respect to the least squares solution w 
with the true feature set F, which we define below: 

w = arg min IIXw — yllo subject to supp(w) C F, (9) 

where F = supp(w). 

The following lemmas require varying degrees of modifications from similar lemmas in |16| , and 
thus the proofs are included for completeness. 

Lemma 3 Define e = ij T (Iw — y). Under the conditions of Assumption^ with probability 
larger than 1 — n: 



Vj £ F : = 0, \wj - Wj\ < a^p^ik)- 1 ln(2p/r?)/n, 

and 

Vj i F : l^-l < aV2p+(l)ln(2p/7 ? )/n. 
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Proof Let P be the projection matrix to the subspace spanned by columns of X in F, then we 
know that 

Xw = Py 

and 

{I - P)Ey = Ey - Iw = 0. 

Therefore for each j 

n\ej\ = \Xj(Xw - y)| = \Xj (I - P)(y - Ey))|. 

It implies that ij = if j G F. Since for each j: the column Xj satisfies — P)!) 2 . < n P+(l) 5 

we have from sub-Gaussian tail bound that for all j ^ F and e > 0: 

P[\ej\ > e] < 2exp[-ne 2 /(2cr 2 / o+(l))]- 
Moreover, for each j G F, we have 

|wj - w,| = eJ(XT^)- 1 xT(y - Ey). 
Since ||ej (JSfT^)- 1 J5Cj||| = e ] {XjXp)- l ej < rT 1 p-Qe)' 1 , we have for all e > 0: 

P[\wj -wj\ > e] < 2exp[-n / o_(A;)e 2 /(2no- 2 )]. 
Taking union bound for j = 1, . . . , d (each with probability n/d) we obtain the desired inequality. ■ 

Lemma 4 Consider G C {1, . . . ,d} such that F D G = $. Let w = 5e f/ie solution of ([6j), and 

it— l) (t— l) 

Zet Aw = w — w. Lei A G = minj-gcA^- and Ao = maxj A^ . J (jf2[|e|| 0Q [| < Xg, then 

El Wo I < -^ttti — |w,H ^— ri — |Awi| < ^— r, — ||Aw<7c||i. 

- r 3 l_ A G -2 e oo A G -2 e «, 1 3U A G -2e J 1 

jeG 3<£F\JG j£F 

(£—1) 

Proof For simplicity, let Aj = AJ- . The first order equation implies that 



1 n 

- ^2 2 ( x * Tw ~ ^) x ',i + A j s S n (wj) = 0, 



i=i 

where sgn(wj) = 1 when Wj > 0, sgn(wj) = —1 when Wj < 0, and sgn(wj) € [—1, 1] when Wj = 0. 
This implies that for all v G MP, we have 

p 

2v T AAw < -2v T e - ^ A J v j sgn(w J ). (10) 



10 



Now, let v = Aw in (|10p . and notice that ip = 0, we obtain 

p 

<2Aw T ,4Aw < 2|Aw T e| - ^ Xj AwjSgn(wj) 

3=1 

<2|| Aw^cllipHoo — AjAwjSgn(wj) — AjAwjSgn(wj) 

j£F j$F 

<2||Aw^c||i||e||oo + ^2 <\j|Awj| - ^2 Ajl^jl 

j£F j<£F 

< ^( 2 l|e|U -A f? )|w i |+ ^ 2||e|| 00 ]w i | +^A |Aw,,-|. 

jGG j^GUF j£F 

By rearranging the above inequality, we obtain the first desired bound. The second inequality uses 
2 1| e || oo < Ao- ■ 

Lemma 5 Using the notations of Lemma [^J and let J be the indices of the largest s coefficients 
(in absolute value) of wg. Let L = G c U J and k = \G C \ . If < Ao/(Ag — 21161100) < 3, then 

||Aw|| 2 < (1 + (3fc/s) - 5 )||Aw 7 || 2 . 

Proof Using Ao/(Ag — 21161100) < 3, we obtain from Lemma [4] 

|| || i < 3||Aw — wgHi. 

Therefore 

||Aw - Aw/ 1 |oo <||Awj||i/s 

=s _1 [||Aw(5||i — ||Aw — Aw/||i] 
<s _1 [3||Aw — wg||i — ||Aw — Aw/||i], 

which implies that 

||Aw - Aw/|| 2 <(||Aw - Aw/||i||Aw - Aw/||oo) 1/2 

< [|| Aw - Aw/||i(3|| Aw - w G ||i - ||Aw - Aw/||i)] 1/2 s~ 1/2 

^[(SHAw-welli^) 2 ] 172 ^ 

<(3/2)s" 1/2 ||Aw- w G ||! 

<(3/2)s- 1/2 fc 1/2 ||Aw- w G || 2 < (3£;/s) 1/2 ||Aw/|| 2 . 

The third inequality uses the simple algebraic inequality a(3b — a) < (36/2) 2 . By rearranging this 
inequality, we obtain the desired bound. Note that in the above derivation, we have used the fact 
that FflG = 0, which implies that Awg = wg, and thus Aw — wg = A wee ■ 
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Lemma 6 Let the conditions of Lemma^ and Lemma\^ hold, and let k = \G C \. If t = 1 — 7r(k + 
s,s)k 1 / 2 s~ 1 G (0,4/3), and < A /(A G - 2||e|| 0O ) < (4 - t)/(4 - 3t), ^en 



|Aw|| 2 < (l + (3fc/ar 5 )||Awj|| 2 < 



1 + (3fc/a 



,0.5 



1/2" 



2||e G c|| 2 + £(A 



'J 



Proof Let J be the indices of the largest s coefficients (in absolute value) of wg, and I = G c U J. 
The conditions of the lemma imply that 

max(0, AwJ^Aw) >p-{k + s)[||Aw/|| 2 - n{k + s, s)||w G ||i/s]||Aw/|| 2 
>p^{k + s)[l - (1 - t)(4 - t)(4 - S^-^IIAwjIH 
>0.5t/)_(Jfe + s)||Aw/|||. 

In the above derivation, the first inequality is due to Lemma [2j the second inequality is due to the 
conditions of this lemma plus Lemma EJ which implies that 



|w G ||i < 



Ar 



Ag-2 e 



|w G c|h < 



A„ 



A G - 2 el 



-v^|| 



w/ 2 ; 



and the last inequality follows from 1 - (1 - t) (4 - t) (4 - 3*)" 1 > 0.5i, which holds for t G (0, 4/3). 

If AwjAAw < 0, then the above inequality, together with Lemma imply the lemma. There- 
fore in the following, we can assume that 

AwjAAw > 0.5tp-(k + s)||Aw/|| 2 . 
Moreover, let Xj = A^ . We obtain from (jlOh with v = Awj the following: 
2AwJy4Aw < -2Aw|e - £ Aj Aw i sgn(w i ) 

= — 2AwJe G c — 2Aw/e G — £ AjAwjSgn(wj) — Aj|Awj| — £ Aj|Awj| 

jeF jeG j'eF c nG c 

<2||Aw/|| 2 ||e G c|| 2 + 2||e G ||oo |Aw,-| + Aj|Aw,-| - Aj|Awj| 

jeG jeF jeG 

<2||Aw/|| 2 ||e G c|| 2 + (£2 A]) 1/2 ||Aw/|| 2 . 

Note that the equality uses the fact that G C F c , and Aw :; sgn(w :) ') = |wj| for j £ F c . The last 
inequality uses the fact that Vj G G: Xj > A G > 2116(51100. Now by combining the above two 
estimates, we obtain 



|Aw/|| 2 < 



1 



tp_{k + s) 

The desired bound follows from Lemma [5j 



2||M 2 + (£a?) 1 / 2 

jeF 
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Lemma 7 Let Xj = A/(|wj| < 9) for some w E BP, then 

1/2 



E A f ^ A /E J d%l < 2^) + A |{j G jP : |w, - w,| > 9}\ 1/2 . 

Proof By assumption, if |wj — Wj| > 9, then 

/(Iwjl < 9) < 1 < J(|wj- - wj| > 9); 
otherwise, -/"(|wj| < 9) < /(|wj| < 29). It follows that the following inequality always holds: 

7(|Wj-| < 9) < IQw s \ < 29) + I(\wj -Wj\>9). 
The desired bound is a direct consequence of the above result and the 2-norm triangle inequality 

£>j + ^?) l/2 < (E * 2 ) 1/2 + (E A -D 1/2 - 



Lemma 8 Define = {j : — Wj| > 9}. Under the conditions of TheoremUl we have for 
all s > 2k: 



w 



5.7X 



w|| 2 < . V rVI^ _1) 



p_(1.5/c + s) 

and 



V 1 - n_ (1.5k + s V 1 



Proof For all t G [0.5, 4/3), by using LemmaO we know that the condition of the theorem implies 
that 

- * <7/5<l^. 
A-2||e||oo ~ ' ~ 4-3t 

Moreover, Lemma Q] implies that the condition 

0.5 < t = 1 - 7r(1.5fc + s, s)(1.5k)°- 5 /s 

is also satisfied. This means that the conditions of Lemma [6] (with Ao = Ac = A) are satisfied. 
Now, we assume that at some £ > 1, 

\G c e \ < 1.5k, where G e = {j <£ F : \f~ 1] = A}, (11) 

then it is easy to verify that G\ \ F C F^~ l \ 
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Moreover, with the definition of G = Ge in Lemma [6] and Lemma we can set Ao = Ag = A 
and obtain (note also that ep = 0) 

1/21 



W 



(I) 



W o <- 



1 + 



< 



< 



< 



<- 



tp-(l.hk + s) 

lHhvg 
tp-(1.5k + s) 

ip_(1.5fc + s) 

1 + ^3 
0.5p_(1.5A; + s) 
5.7A 



21 e 



G c e \F\\2 



2 A /|F(^i)\F|||e|| 00 + J|^-DnF|A 



A 



1.082|F( f - 1 )| 



'p_(1.5fc + s) 



where the first inequality is due to Lemma The second inequality uses the facts that GS\Fc 
\ F, and Lemma [7J with I(\wj\ < 26) = (for all j £ F). The third inequality uses 
2||e||oo < (2/7)A, and the fourth inequality uses (2/7)a + 6 < ■ v /1.082(a 2 + 6 2 ). 
Since Lemma [3] implies that 



w — w 



< (l/7)\/Jp + (l)p_(k), 



we know that j £ F™' implies that 



i ~ - (t) i ^ 



(l/7)A/Vp+(l)p-(fc) > (41/42)5. 



Therefore 



<(410/42)- 1 ||w- w W|| 2 



< 



< 



5.7A(41g/42)~ 1 
p_(1.5fc + s) 
6A0- 1 



>_(1.5fc + s) 

That is, under the assumption of (|lip . the lemma holds at I. 

Therefore next we only need to prove by induction on £ that (jlip holds for all £ = 1,2,.... 
When £ = 1, we have = which implies that (|lip holds. 

Now assume that (jlip holds at £ for some £ > 1. Then by the induction hypothesis we know 
that the lemma holds at £. This means that 



G° i+1 \F\<J\F(e)\ 



<- 



6X9- 



' p-(1.5k + s) 



^O.^F^- 1 )] 
<---<0.5 £ / 2 |F (0) l- 
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The first inequality is due to the fact G c l+1 \ F C . The second inequality uses the assumption 
of 9 in the theorem. The last inequality uses induction. Now note that F^ = F, we thus have 
|G^ +1 \ F\ < 0.5k. This completes the induction step. ■ 

5.2 Proof of Theorem [T] 

Define 

p = 6Ag : x 

We have /3 < 1 by the assumption of the theorem. Using induction, we have from Lemma [8] that 

^\F^\<f3^J\F^)\ 
<•• 

<f3 L J\FM\ 
<f3 L V% < 1. 

This means that when I > L, = 0. Therefore by applying Lemma [8] again we obtain 

||w W - w|| 2 = 0. 

Since Lemma [3] implies that 

||w - w|U < (l/7)A/^/p+(l)p_(fe) < e, 

we have 

supp(w) = supp(w). 
This implies that supp(w^)) = supp(w). 

6 Discussion 

This paper investigated the performance of multi-stage convex relaxation for feature selection, 
where it is shown that under RIP, the procedure can achieve unbiased feature selection. This result 
complements that of [16J which studies the parameter estimation performance of multi-stage convex 
relaxation. It also complements similar results obtained in [17] and [13J for different computational 
procedures. One advantage of our result over that in |13j is that the multi-stage convex relaxation 
method is provably efficient because the correct feature set can be obtained after no more than 
(9(log k) number of iterations. In comparison, a computational efficiency statement for the path- 
following method of [13] remains open. 
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